Skip to content

feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy)#239

Merged
chapmanhk merged 7 commits into
developfrom
feature/gcs-bronze-sync-databricks-api
May 29, 2026
Merged

feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy)#239
chapmanhk merged 7 commits into
developfrom
feature/gcs-bronze-sync-databricks-api

Conversation

@chapmanhk

@chapmanhk chapmanhk commented May 18, 2026

Copy link
Copy Markdown
Collaborator

feat(data): trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy)

Description

After a successful file validation (POST .../input/validate-upload/{file_name} or SFTP validate path), the API starts the Databricks job edvise_validated_gcs_to_bronze_sync to copy the object from GCS validated/ into the institution's bronze volume (gcs_uploads). Validation and batch creation are unchanged; Databricks trigger failures are logged and do not fail the validation response.

The API waits only for Databricks jobs.run_now to accept the run and return a run id. It does not wait for cluster startup or the file copy to finish.

Behavior

  • Runs only for institutions with edvise_id or legacy_id (PDP-only institutions are skipped).
  • Uses existing Databricks auth (DATABRICKS_HOST_URL, GCP service account).
  • Resolves the job by optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID, exact job name, DEV/STAGING job ID mapping, or a unique bundle-prefixed job name.
  • Structured JSON trace logs: validation_request, gcs_bronze_sync_background_start, gcs_bronze_sync_background_done with outcome (success | trigger_failed | skipped) and correlation_id for cross-log lookup.

New / updated

  • src/webapp/databricks.pyrun_validated_gcs_to_bronze_sync, job resolution, bundle-aligned job parameters
  • src/webapp/routers/data.py — validation-time Databricks trigger in validation_helper
  • src/webapp/databricks_test.py, src/webapp/routers/data_test.py
  • src/webapp/.env.example — documents optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID

Kill switch: ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION=false (default: enabled).

Deployment Readiness*

Testing

Describe or check:

  • Created or updated unit, feature, and/or integration tests
  • Typical manual testing in the local env browser, dev pipeline, etc.

Automated: databricks_test.py (job ID resolution, DEV/STAGING mapping, bundle-prefixed job name resolution, run_now params contract); data_test.py (Edvise/Legacy trigger paths, PDP-only skip, env disabled, non-fatal Databricks trigger failure).

Manual (dev): Deployed feature branch to dev and validated upload as Legacy institution. Confirmed validation succeeded, the API selected the DEV Databricks job id, and the bronze sync job was triggered successfully. Verified expected logs include outcome:"success" and databricks_job_run_id; corresponding run is visible in Databricks Workflows.

Deployment Notes

Describe or check:

  • No special deployment steps required
  • Special deployment steps required

Rollback Plan

Describe or check:

  • Standard revert is sufficient (git revert)

Revert the merge commit. Optionally set ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION=false immediately if a hot disable is needed before revert ships. Validation and existing Databricks flows are unaffected.

Reviewer Guidance / Questions*

  • Job parameters are pinned to the edvise bundle contract (github_validated_bronze_sync.yml); changes there need a matching API update.
  • This intentionally triggers Databricks during the validation request, but only waits for run_now to submit the job. It does not wait for the copy itself.
  • Databricks trigger errors are non-fatal to validation and are logged as outcome:"trigger_failed".
  • Job resolution includes DEV/STAGING deployed job IDs to handle current Databricks bundle naming differences.

Screenshots / Testing Evidence*

Expected success log:

{"event":"validation_request","correlation_id":"...","inst_id":"...","bucket":"...","file_name":"...","validation_source":"MANUAL_UPLOAD"}
{"event":"gcs_bronze_sync_background_start","correlation_id":"...","inst_id":"...","bucket":"...","file_name":"..."}
{"event":"gcs_bronze_sync_background_done","correlation_id":"...","outcome":"success","validated_blob_path":"validated/<file_name>","databricks_job_run_id":123,"databricks_job_name":"edvise_validated_gcs_to_bronze_sync"}

Databricks: corresponding run visible under Workflows → Jobs for [dev dev_cloudrun_sa] edvise_validated_gcs_to_bronze_sync.

SOC 2 Change Management Checklist

  • None of the below are true in this code
  • New roles/permissions are introduced without review and approval by the product manager
  • Hardcoded credentials, secrets, or API keys are present in this code
  • Secrets are being managed outside of the approved secrets management process (e.g., GitHub Secrets, environment variables)
  • PII or sensitive data handling is introduced or changed without being reviewed against our data classification policy
  • Sensitive data is written to logs
  • Input validation and sanitization is missing
  • An unnecessary attack surface has been introduced (e.g., unused endpoints, open ports, debug modes left enabled)
  • Common vulnerabilities have been introduced in the code (inc. any dependencies added or updated)
  • No review for common vulnerabilities has been conducted
  • Not tested in a non-production environment
  • Breaking changes to existing APIs or integrations with downstream consumers being notified
  • Performance impact has not been considered or acceptable
  • Appropriate audit logging is missing for any security-relevant actions introduced by this change
  • Log entries contain sensitive or PII data
  • All existing tests do not pass locally (./vendor/bin/pest)

Provide justification if you are submitting a PR with any boxes checked other than the first.


Reminder for Reviewers: By approving this PR you are confirming that you have reviewed the code for correctness, security, and compliance with our engineering and SOC 2 standards. Do not approve PRs where SOC 2 checklist items are checked without documented justification.


chapmanhk and others added 4 commits April 28, 2026 15:09
- Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync
  with include_blob_paths_json for validated/{file_name}.
- Call after successful validate-upload / validate-sftp when edvise_id or legacy_id
  is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable.
- Failures to start the job are logged and do not fail validation.
- Extend data tests with DatabricksControl mock and assertions.

Made-with: Cursor
…lution

Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/
writes. Add correlation_id and JSON trace logs (validation_request, background
start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by
name with duplicate detection when unset. Refine skip reasons for PDP vs
Edvise/Legacy.

Co-authored-by: Cursor <cursoragent@cursor.com>
Extract Databricks helpers and job-parameter constants, use specific
exceptions (ValueError, DatabricksError), and split background logging
into focused functions under 50 lines. Add tests for PDP-only and env
kill-switch skips plus run_now parameter contract coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@chapmanhk chapmanhk changed the title Feature/gcs bronze sync databricks api feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy) May 27, 2026
chapmanhk and others added 3 commits May 27, 2026 20:36
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@chapmanhk chapmanhk marked this pull request as ready for review May 28, 2026 05:27
@chapmanhk chapmanhk requested a review from vishpillai123 May 28, 2026 05:28

@vishpillai123 vishpillai123 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice error handling! Looks good

@vishpillai123

vishpillai123 commented May 28, 2026

Copy link
Copy Markdown
Collaborator

@chapmanhk has this been tested on dev webapp yet or still needs to be tested?

@chapmanhk

Copy link
Copy Markdown
Collaborator Author

@chapmanhk has this been tested on dev webapp yet or still needs to be tested?

It's been tested on the webapp!

@chapmanhk chapmanhk merged commit 7066397 into develop May 29, 2026
6 checks passed
@chapmanhk chapmanhk deleted the feature/gcs-bronze-sync-databricks-api branch May 29, 2026 15:37
vishpillai123 added a commit that referenced this pull request Jun 17, 2026
* docs: inherit org community health files (#237)

* docs: remove local community health files to inherit from org-wide .github repo

* docs: update README to include previous contributing info

* feat(api): simplify create model request to name only (#238)

* chore: bump edvise dependency to 1.0.0 (#241)

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* chore: creating dummy changlog.md file while we create semver / gitflow process

* feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy) (#239)

* Trigger GCS→bronze Databricks sync after validation (Edvise/Legacy)

- Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync
  with include_blob_paths_json for validated/{file_name}.
- Call after successful validate-upload / validate-sftp when edvise_id or legacy_id
  is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable.
- Failures to start the job are logged and do not fail validation.
- Extend data tests with DatabricksControl mock and assertions.

Made-with: Cursor

* feat(data): bronze sync after validation with tracing and job id resolution

Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/
writes. Add correlation_id and JSON trace logs (validation_request, background
start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by
name with duplicate detection when unset. Refine skip reasons for PDP vs
Edvise/Legacy.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(data): align bronze sync with universal principles

Extract Databricks helpers and job-parameter constants, use specific
exceptions (ValueError, DatabricksError), and split background logging
into focused functions under 50 lines. Add tests for PDP-only and env
kill-switch skips plus run_now parameter contract coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply ruff format to bronze sync modules

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): resolve prefixed bronze sync Databricks jobs

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): map bronze sync job ids by environment

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): trigger bronze sync during validation request

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(api): validate Edvise uploads with repo schemas (#242)

* feat(api): validate Edvise uploads with repo schemas

Route Edvise student and course uploads through upstream edvise Pandera schemas so upload validation matches the pipeline contract and is not bypassed by registry schema drift.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(api): separate Edvise validation routing

Keep PDP and Edvise repo-schema upload validation paths distinct so each helper has a single responsibility while preserving the same validation behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(api): remove redundant repo validation fallback

Keep JSON validation flow focused now that PDP and Edvise repo-schema uploads are routed before schema merging.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style(api): format validation routing test

Apply Ruff formatting to keep the Edvise validation routing tests passing style checks.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(webapp): establish pyproject.toml as canonical Edvise API version (#243)

* feat(webapp): establish pyproject.toml as canonical Edvise API version, set version and title in OpenAPI

* docs: rename SST -> Edvise

* docs(webapp): update formatter instructions to ruff to align with github workflows and engineering playbook

* test(webapp): assert OpenAPI version matches pyproject.toml

* feat(eda): add clear_cache option to /eda endpoint (#233)

* feat: legacy school inference DB job trigger (#212)

* feat: custom school inference, but need to confirm if custom is the same as legacy

* fix: transitioning from 'custom' to 'legacy'

* fix: remove validation of job parameters, handled already through edvise

* fix: run request still requires str values, defaulting to empty string

* fix: still getting pydantic error

* feat: using substring matching to find legacy job since i have it deployed under my name because of target==dev

* fix: style

* fix: style

* fix: style

* fix: making batch file name more robust so we don't run into decoding issues

* fix: merge conflict

* fix: merge conflict

---------

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* feat: add gen ai as 4th option in addition pdp / edvise / legacy in api and uploads (#244)

* feat: Added "GenAI" as an option for "create institution"

note: for GenAI raw files, we will reuse the same loose rules as Legacy institutions (read CSV, PII check, no strict ES columns).

* fix: style

---------

Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* fix(models): derive PDP batch schema configs from institution schemas (#247)

* fix(models): derive PDP batch schema configs from institution schemas

When model.schema_configs is null, PDP inference now builds a default
required COURSE+STUDENT (etc.) batch rule from inst.schemas instead of
500ing. Explicit model configs still take precedence.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(models): cast jsonpickle decode for mypy no-any-return

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(databricks): prefer Cloud Run job when pipeline name is ambiguous

When multiple dev bundle jobs match a PDP or legacy inference pipeline
substring, resolve to [dev dev_cloudrun_sa] if present; otherwise use
the first sorted match instead of failing the inference request.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* ci: datakind shared workflows (#245)

* ci: datakind shared workflows

* refactor: rename test.yml -> tests.yml

* fix(ci): add workflow_call to style and tests workflows

* refactor: use pre-release workflow from shared workflows

* ci: replace with shared enforce-pr-targets workflow

Aligns checks against the current protected branches, main and develop, rather than staging

* chore: remove unused workflow

* refactor: remove pull_request triggers. These run via ci.yml

* ci: pin tests and type-check to Python 3.13

* chore(ci): remove legacy webapp-and-worker precommit workflow

* ci: standardize on Python 3.12 across workflows and pyproject

* ci: test workflow enforcement

* ci: test workflow enforcement

* ci: add gate job to report required ci status check

* chore: bump python version to 3.10

* chore: standardize Python 3.12 across project and Docker

* chore: updating edvise v1.2.0

* chore: CHANGELOG.md update + type check

* chore(release): bump version

* ci(cloudbuild): parameterize webapp deploy for multi-environment triggers

---------

Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com>
Co-authored-by: Vishakh Pillai <64162993+vishpillai123@users.noreply.github.com>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>
mrmaloof added a commit that referenced this pull request Jun 17, 2026
* Merge pull request #217 from datakind/fix/pdp-course-handling-duplicates

fix: fix duplicate-handling step in validation

* fix(storage): reduce peak memory during upload validation

- Download unvalidated blob to a temp file and validate by path instead of
  blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy).
- Write validated CSV to a temp file and upload_from_filename instead of
  building the entire CSV in a StringIO string.

Branched from develop (repo has no dev branch).

Made-with: Cursor

* chore(storage): log errno on temp download/to_csv OSError

Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged.

Made-with: Cursor

* test(storage): cover temp cleanup and OSError logging for validate upload

- Download OSError: unlink temp, skip validate_file_reader, log errno
- to_csv OSError: unlink temp, no upload, log errno
- Upload failure after to_csv: temp still unlinked

Made-with: Cursor

* refactor(storage): extract temp download/unlink helpers for clarity

Aligns with universal-principles: keep _run_validation_and_get_normalized_df
under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming.

Made-with: Cursor

* style: apply black/ruff format to gcsutil_test.py

Made-with: Cursor

* feat: consolidating staging into main and using main going forward as production (#234)

* Feat: Added backfill endpoint

* Fix: linting

* added func description

* added func description

* added func description

* added func description

* added func description

* added func description

* added func description

* feat: adjusted run output endpointto return model_run_id

* Delete .DS_Store

* Delete src/.DS_Store

* Delete terraform/.DS_Store

* feat: added model deletion endpoint

* feat: added model deletion endpoint

* feat: added model deletion endpoint

* fix: linting

* fix: linting

* fix: linting

* fix: linting

* fix: linting

* fix: linting

* fix: linting

* fixed model name malformation

* fix: removed databricks deletion functionality

* fix: removed query results not needed

* fix: removed query results not needed

* fix: added status

* fix: added status

* fix: formatting fix

* fix: added query to retrieve model id

* fix: added passive delete to db cascade so deleting the model ensures job runs are deleted

* fix: removed extra db query for model id, since db now handles passive deletes

* fix: formatting fix

* fix: removed db mapping framework

* fix: removed db mapping framework

* fix: removed db mapping framework

* fix: removed db mapping framework

* feat: changed endpoint parameter name from experiment_run_id -> model_run_id

* fix: type check errors

* test batch and file data

* eda endpoints

* test data

* eda calculations

* eda year and term, course enrollemnts

* eda degree types

* fix: divide data category into a seperate front end table section

* fix: linting

* feat: developed function for adding custom jobs with institution and model validation

* fix: linting errors

* fix: linting errors

* fix: changed route from GET to POST

* fix: added output filename definition

* fix: linting errors

* eda test institution data

* eda test institution

* eda data

* eda test data

* allow missing eda data

* eda enrollment type by intensity

* eda pell recipient by 1st gen

* eda student age by gender

* eda pell status by race

* eda tests

* cache eda

* tidy up

* remove LOCAL test bucket setup

* return List from get_term_counts

* import pandas

* remove unused variable

* tidy up

* eda bucket names

* fix: type check errors

* fix: type check errors

* fix: type check errors

* fix: formatting errors

* fix: type check errors

* fix: type check errors

* fix: batch name renewal

* fix: batch name renewal

* fix: changed output_valid to true

* fix: adjusted model card file path

* fix: ensuring we are grabbing the most recent run for a model id

* remove colors from /eda endpoint

* return count and percentage in /eda degree_types

* tidy up

* fix: fix file format

* fix: retrieve by model_run_id instead

* fix: formatting

* fix: validation error for worwic

* fix: changed model name to model_run_id parameter

* fix: added function to retrieve config.toml from select catalog

* manually initialized course mappings

* feat: added validation mapping

* fix: formatting

* fix: pylint

* Ignore .cursor folder for personal cursor preferences

* feat(schema): add Edvise schema definition

* feat(institutions): add Edvise schema support

Add Edvise schema support to institution management:

- Add edvise_id field to InstTable and SchemaRegistryTable

- Update create/update endpoints with Edvise support and validation

- Add mutual exclusivity check (PDP vs Edvise)

- Implement normalization for empty strings and whitespace

- Remove redundant boolean flags (derive status from ID presence)

- Add comprehensive test coverage (34 new test cases)

All changes are backward compatible.

* fix: resolve CI/CD test failures

- Fix test_create_inst_with_edvise_success: use unique institution name to avoid UNIQUE constraint violation
- Fix test_trigger_inference_run: add pdp_id to InstTable fixture in models_test.py
- Fix code formatting: run ruff format on database.py, institutions.py, and institutions_test.py

These fixes address the three issues that were causing CI/CD test failures:
1. UNIQUE constraint failed: inst.name in test_create_inst_with_edvise_success
2. Assertion error: expected 400 but got 501 in test_trigger_inference_run
3. Ruff format check failures

* fix: resolve unique constraint conflicts in SchemaRegistryTable

- Add doc_type to is_pdp and is_edvise unique constraints to allow base, PDP, and Edvise schemas to coexist with same version
- Add CheckConstraint to enforce mutual exclusivity of is_pdp and is_edvise flags

Fixes Bugbot issue: Unique constraint prevented coexisting schema types for same version. The original constraints (is_pdp, version_label) and (is_edvise, version_label) prevented base schema and PDP/Edvise extensions from sharing the same version label since they all had is_pdp=False and is_edvise=False. Adding doc_type to these constraints allows proper coexistence while maintaining uniqueness guarantees.

Also adds database-level enforcement that is_pdp and is_edvise cannot both be True simultaneously.

* fix: resolve mypy type errors

- Fix type error in institutions.py: change set to list for requested_schemas default value
- Add return type annotations to all test functions in institutions_test.py
- Add return type annotations to fixture functions
- Add typing.Any import for fixture return types

Fixes mypy errors: incompatible types in assignment and missing return type annotations.

* fix: add missing type annotations to test function parameters

- Add TestClient type annotations to test_create_inst_unauth, test_create_inst, test_edit_inst, and test_delete_inst

Fixes mypy errors: Function is missing a type annotation for one or more arguments.

* feat: Implement Phase 3 Edvise schema validation logic

- Add EDVISE_SCHEMA_GROUP constant to utilities.py (mirrors PDP_SCHEMA_GROUP)
- Add _edvise_cache to _ValidationState class for schema caching with TTL
- Update validation_helper() to load Edvise schema when edvise_id is set
- Add defensive check for mutual exclusivity (pdp_id and edvise_id cannot both be set)
- Add error handling for missing Edvise schema with clear error messages
- Update institution creation endpoint to use EDVISE_SCHEMA_GROUP when edvise_id is provided
- Add comprehensive test suite: 15 tests covering happy path, errors, cache, authorization, and edge cases

This implementation enables institutions with edvise_id to use the Edvise schema extension
for file validation, following the same pattern as PDP schema validation. All changes are
backwards compatible and include comprehensive test coverage (~90% of critical paths).

* fix: Resolve Edvise test failures and improve test reliability

- Fix type annotation error in PDP schema branch (mypy no-redef)
- Change test user to DATAKINDER for multi-institution access
- Fix database constraint violation in precedence test (version_label)
- Simplify cache tests to verify behavior instead of implementation
- Remove duplicate assertion in cache expiration test
- Optimize imports in test fixture

* fix: Update Edvise test filenames to include descriptive keywords

- Change generic test filenames (test.csv, test_file.csv, etc.) to include 'student' keyword
- This allows validation_helper to properly infer model types from filenames
- Fixes ValueError: Could not infer model(s) from file name errors
- Formatting will be applied by CI ruff formatter

* style: Format data_test.py with ruff

* fix(validation): return proper HTTP status codes for institution errors

- Change ValueError to HTTPException (404) when institution not found in validation_helper
- Fix test_validate_edvise_unauthorized to test actual unauthorized access instead of non-existent institution
- Ensures proper HTTP status codes are returned to API clients

* fix: handle filename inference errors and extension schema deactivation

- Replace ValueError with HTTPException (422) for filename inference failures
  to return proper user-facing error instead of 500
- Deactivate existing extension schemas before inserting new ones to ensure
  only one active extension per institution and prevent nondeterministic queries
- Add comprehensive validation error formatter with PII masking and user-friendly messages
- Add integration and snapshot tests for error formatter

* fix: remove unused imports from validation_error_formatter_snapshot_test

- Remove unused typing imports (Any, Dict, List)
- Remove unused pandera imports (DataFrameSchema, Column, Check)
- Remove unused MAX_ERROR_EXAMPLES import

Fixes ruff linting errors (F401) reported in CI.

* fix: resolve test failures and configuration issues

- Remove invalid catalog_name parameter from create_custom_schema_extension call
- Restore testpaths configuration to use src directory
- Add Pandera FutureWarning filter to pytest config
- Fix syntax warning in databricks.py docstring
- Format files with Ruff

* fix: resolve Ruff and Mypy linting errors

- Remove unused imports (IO, cast, tomli/tomllib) from databricks.py
- Remove duplicate import re statement
- Add type annotations to test cases in validation_error_formatter_test.py
- Add type: ignore comments for intentional invalid type tests

* fix: align database constraints with production schema and fix Edvise version_label collision

- Fix uq_pdp_version constraint to match production: remove doc_type (matches actual DB schema)
- Remove uq_edvise_version constraint (enforced operationally, not via DB constraint)
- Update CHECK constraint to use MySQL-compatible boolean values (1/0 instead of TRUE/FALSE)
- Fix Edvise test fixture to use version_label='edvise-1.0.0' to avoid uq_pdp_version collision
- Add explanatory comment about version_label choice in test fixture

These changes ensure the ORM matches the actual production database schema and prevent
constraint violations when running tests against MySQL.

* fix: handle parameterized Pandera check types in validation error formatting

Fix bug where parameterized check types (e.g., "isin(['A', 'B', 'C'])",
"str_length(3, None)") were not being matched to their formatters, causing
generic error messages instead of human-readable ones.

Changes:
- Add _extract_base_check_type() to extract base type from parameterized
  check types (e.g., "isin(['A', 'B'])" -> "isin")
- Add _normalize_check_type_alias() to map verbose Pandera names to spec
  keys (e.g., "greater_than" -> "gt", "greater_than_or_equal_to" -> "ge")
- Update _find_check_spec() to use base type extraction and alias normalization
- Update _format_check_error() to only format when matching spec is found
  (prevents semantic errors like formatting "greater_than" as "ge")
- Add _format_gt_error() and _format_lt_error() for strict comparison checks
- Preserve semantic correctness: strict comparisons (> and <) vs non-strict (≥ and ≤)

Edge cases handled:
- Namespaced types: "Check.isin(['A'])" -> "isin"
- Empty/None/non-string inputs: returns safe empty string
- Spaces around parentheses: "isin (['A'])" -> "isin"
- Complex repr: "str_matches(re.compile('...'))" -> "str_matches"

Testing:
- Add comprehensive unit tests for base type extraction and alias handling
- Add tests for parameterized check types (isin, str_length, gt, ge)
- Update integration test assertion to match actual output format
- Update snapshot fixtures to reflect new human-readable messages

Fixes parameterized check type matching while maintaining semantic
correctness for strict vs non-strict comparisons.

* style: format validation_error_formatter files with ruff

Auto-formatted files to comply with project formatting standards.

* feat: add case-insensitive institution name lookup

- Implement case-insensitive matching for GET /institutions/name/{inst_name} endpoint
- Use func.lower() on both database column and input parameter for case-insensitive comparison
- Update docstring to document case-insensitive behavior and error handling
- Add comprehensive test cases for case-insensitive matching:
  - Test multiple case variations (original, title case, uppercase, mixed case)
  - Test lowercase input matching database entries
  - Test uppercase input matching lowercase database entries
- Fix type error: change requested_schemas assignment from set to list for type consistency
- Apply code formatting with ruff

* fix: add missing return type annotations to test functions

- Add Generator import from typing for fixture return types
- Add return type annotations (-> None) to all test functions:
  - test_read_all_inst
  - test_read_all_inst_datakinder
  - test_read_inst_by_name
  - test_read_inst_by_name_case_insensitive
  - test_read_inst_by_name_case_insensitive_lowercase
  - test_read_inst_by_name_case_insensitive_uppercase
  - test_read_inst_by_pdp_id
  - test_read_inst
- Fix fixture return types to use Generator[TestClient, None, None]
  - client_fixture
  - datakinder_client_fixture
- Resolves mypy type checking errors for test file

* style: apply ruff formatting to test file

- Split long function signatures across multiple lines for readability
- Format client_fixture and datakinder_client_fixture function signatures
- Format test_read_inst_by_name_case_insensitive_lowercase and _uppercase function signatures

* fix(test): update institutions test for edvise_id API changes

- Remove unused typing.Any import
- Update test_read_all_inst_datakinder to include edvise_id in expected response
- Add edvise_test_school institution to expected response (4 institutions total)
- Fix line length for pylint compliance

This fixes test failures caused by API changes from develop branch that now
return edvise_id and pdp_id fields for all institutions.

* fix(validation): pass institution_id so Edvise/PDP/custom use correct extension block

- Thread schema_namespace (edvise | pdp | inst UUID) from data router through
  validate_file and validate_file_reader into validate_dataset
- merge_model_columns now receives correct key for extension_schema['institutions']
- Add institution_id param with default 'pdp' for backward compatibility
- Add tests: assert Edvise validation passes institution_id='edvise'; add unit test
  that institution_id selects the right extension block (edvise vs pdp)
- Expand docstrings (Args/Returns) and add comment explaining schema_namespace
- Addresses reviewer Q1: schema extension logic now works for Edvise and custom
  institutions, not only PDP

* Apply Black formatting to institutions_test.py

* Apply ruff format to institutions_test.py

* Fix institutions_test assert for Black and Ruff format compatibility

* Fix pylint E1135 in data_test: use .get() instead of membership test on captured_schema

* Apply ruff format to data_test.py

* feat(validation): schema validation during upload with PDP/edvise repo alignment

- Add PDP edvise schema validation path (validation_pdp_edvise)
- Add Edvise-to-PDP normalization (validation_edvise_normalize)
- Integrate repo schemas into validation pipeline and error formatter
- Update pdp_schema_extension and lockfile; add tests

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(validation): write normalized data to validated/, archive raw to raw/

- On validation success: archive original to raw/{filename}, write
  normalized (canonical columns, coerced dtypes) DataFrame to
  validated/{filename}, delete from unvalidated/
- Validation layer always returns normalized_df on success; storage
  serializes to UTF-8 CSV and uploads to validated/
- Add input validation and helpers in gcsutil (under 50 lines); catch
  specific exceptions; TYPE_CHECKING for HardValidationError in
  validation_pdp_edvise
- Add gcsutil_test.py: validate_file input/error/success paths,
  _run_validation_and_get_normalized_df, _write_dataframe_to_gcs_as_csv
- Add validation_test: empty-schema short-circuit returns normalized_df None
- Ruff/black formatting and lint fixes; mypy-clean for touched files

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(validation): align with universal principles, add tests, fix types and format

- Extract validation helpers to meet 50-line rule (_header_missing_and_extra,
  _get_csv_read_kwargs, _validate_optional_columns_json)
- Extract gcsutil._archive_raw_and_write_validated; add type hints to rename_file
- Add tests: PDP rename/validate_dataframe, CSV read failure, gcsutil error
  propagation, edvise institution_identifier in validate_file call
- Remove unused validation_edvise_normalize and its tests
- Fix mypy in validation_pdp_edvise and tests (Optional[List], cast, annotations)
- Apply ruff format

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(validation): use edvise read for PDP uploads and add PDP path tests

- Route PDP cohort/course through edvise read (read_raw_pdp_*); remove
  API-side normalizers for PDP so pipeline and API share one source of truth
- Add _path_for_edvise_read, _read_pdp_course_edvise, _validate_pdp_with_edvise_read
- Convert Pandera SchemaErrors to HardValidationError in PDP path
- Add validation_pdp_read_path_test.py (routing, path cleanup, SchemaErrors,
  course converter fallback); extend Src type with io.StringIO for file-like

Co-authored-by: Cursor <cursoragent@cursor.com>

* move cloud build config to repo

* sst-app-api -> edvise-api

* quiet down sqlalchemy

* use EdaSummary from edvise

* use ruff formatter

* test a file

* tidy up

* Add return type annotations for mypy in main_test and users_test

* tidy up

* move cache check after batch result check

* fix test_execute_pdp_pull

* install git

* install git in correct Dockerfile

* install git in worker

* update edvise branch

* use develop branch for edvise

* install edvise in build

* cloudbuild with edvise

* fix(validation): resolve pylint used-before-assignment error

Initialize schema_err_to_raise before try block to satisfy pylint's
static analysis, which doesn't recognize that pytest.skip() always raises.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(api): add legacy school type with any-format uploads

- Add legacy_id to InstTable and institution API (create, update, read)
- Enforce mutual exclusivity of pdp_id, edvise_id, legacy_id via has_at_most_one_school_type
- Legacy validation: encoding + CSV read only, no schema checks
- Add LEGACY_SCHEMA_GROUP and tests for legacy path and mutual exclusivity

Made-with: Cursor

* feat(api): legacy PII check, principles compliance, and test coverage

- Add PII column check for legacy uploads; reject before raw/validated
- Treat student_id as non-PII (false positive) for all institution types
- Comply with universal principles: docstrings, extract create_institution
  helpers (<50 lines), comment lazy import in validation
- Add tests: has_at_most_one_school_type, legacy header-only CSV,
  legacy PII rejection returns 400, explicit legacy_id create,
  update add legacy_id, storage/Databricks failure paths
- Fix mypy in create_institution (row variable)

Made-with: Cursor

* docs(api): use Edvise Schema (ES) naming to reduce confusion

Replace 'Edvise schema' with 'Edvise Schema (ES)' in docstrings,
comments, and user-facing error messages so the schema type is
distinguished from the Edvise product (ES convention).

Made-with: Cursor

* feat(data): allow legacy institutions to upload files with any filename

- Fetch institution before filename inference; set allowed_schemas to UNKNOWN when
  inference fails for legacy (non-legacy still get 422 for non-descriptive names)
- Refactor validation_helper into helpers under 50 lines; add full docstrings,
  early empty-filename and invalid inst_id validation, log before 404
- Add unit tests for _infer_allowed_schemas_from_filename and _ext_models_set
- Add integration tests: empty filename 422, invalid inst_id 404, edvise
  non-descriptive filename 422, duplicate validate idempotent
- Fix mypy and ruff/black in data.py and data_test.py
- Add PR_DESCRIPTION.md for feature branch

Made-with: Cursor

* chore: remove PR_DESCRIPTION.md

Made-with: Cursor

* fix(validation): run PII check for header-only legacy CSVs

* fix(test): align validation error snapshot with non-PII student_id display

Made-with: Cursor

* feat(validation): use PDP cohort converter and support custom converters

- Use converter_func_cohort by default for PDP cohort validation (filters DE/DS/SE)
- Add optional pdp_cohort_converter_func and pdp_course_converter_func to
  validate_file_reader and validate_dataset for school-specific overrides
- Course validation tries custom converter first, then default handling_duplicates
- Validate converter args are callable; convert converter/read failures to
  HardValidationError so API returns 400 with context
- Add PDPConverterFunc type; extract helpers to meet 50-line and error-handling rules

Made-with: Cursor

* fix(validation): satisfy mypy for PDP validation and tests

- Add unreachable return after with block in _validate_pdp_with_edvise_read
- Use cast(Any, ...) in tests that pass non-callables to converter params

Made-with: Cursor

* chore: remove real institution names

* chore: ruff format

* fix: use latest edvise EdaSummary

* fix: use edvise develop branch

* chore(deps): pin edvise to develop

* feat(ci): notify slack channel on deployment

* fix: lock file was out of sync

* chore: bump edvise version to 0.1.12

* Revert "feat: legacy school type with any-format uploads, PII check, and Edvise Schema (ES) naming"

* Revert "Revert "feat: legacy school type with any-format uploads, PII check, and Edvise Schema (ES) naming""

* feat(config): add optional local inst/batch/file seed from config for LOCAL

* style: ruff format

* fix(validation): pass schema_type to handling_duplicates for PDP course CSV

read_raw_pdp_course_data calls converter_func(df) with one argument; bare
handling_duplicates is invalid on current edvise. Use a wrapper that calls
handling_duplicates(df, "pdp") positionally for edvise compatibility.

Remove the broken second default converter. Update PDP read path test.

Made-with: Cursor

* style: ruff format PDP course read path test

Made-with: Cursor

* fix(deps): upgrade databricks-sql-connector for pyarrow>=17 (edvise)

databricks-sql-connector 3.5 pins pyarrow<17; edvise requires pyarrow>=17.
Use databricks-sql-connector[pyarrow]~=4.2.x and refresh uv.lock (pyarrow 19).

Aligns lock with Cloud Build 'uv lock --upgrade-package edvise'.

Made-with: Cursor

* Merge pull request #217 from datakind/fix/pdp-course-handling-duplicates

fix: fix duplicate-handling step in validation

* fix(storage): reduce peak memory during upload validation

- Download unvalidated blob to a temp file and validate by path instead of
  blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy).
- Write validated CSV to a temp file and upload_from_filename instead of
  building the entire CSV in a StringIO string.

Branched from develop (repo has no dev branch).

Made-with: Cursor

* chore(storage): log errno on temp download/to_csv OSError

Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged.

Made-with: Cursor

* test(storage): cover temp cleanup and OSError logging for validate upload

- Download OSError: unlink temp, skip validate_file_reader, log errno
- to_csv OSError: unlink temp, no upload, log errno
- Upload failure after to_csv: temp still unlinked

Made-with: Cursor

* refactor(storage): extract temp download/unlink helpers for clarity

Aligns with universal-principles: keep _run_validation_and_get_normalized_df
under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming.

Made-with: Cursor

* style: apply black/ruff format to gcsutil_test.py

Made-with: Cursor

* fix(storage): reduce peak memory during upload validation

- Download unvalidated blob to a temp file and validate by path instead of
  blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy).
- Write validated CSV to a temp file and upload_from_filename instead of
  building the entire CSV in a StringIO string.

Branched from develop (repo has no dev branch).

Made-with: Cursor

* chore(storage): log errno on temp download/to_csv OSError

Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged.

Made-with: Cursor

* test(storage): cover temp cleanup and OSError logging for validate upload

- Download OSError: unlink temp, skip validate_file_reader, log errno
- to_csv OSError: unlink temp, no upload, log errno
- Upload failure after to_csv: temp still unlinked

Made-with: Cursor

* refactor(storage): extract temp download/unlink helpers for clarity

Aligns with universal-principles: keep _run_validation_and_get_normalized_df
under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming.

Made-with: Cursor

* style: apply black/ruff format to gcsutil_test.py

Made-with: Cursor

* chore: bump edvise v0.2.0

* fix(pdp-validation): default cohort converter to none

Stop passing edvise converter_func_cohort when pdp_cohort_converter_func is omitted so PDP cohort rows are validated as read.

- Callers may still pass an explicit cohort converter.

- Update PDP read-path test to expect converter_func=None.

- Refresh docstrings (pipeline vs API, Args/Returns/Raises) in validation and validation_pdp_edvise.

Made-with: Cursor

* feat(api): remove custom institution path; require school type; legacy schemas UNKNOWN

- Require exactly one of PDP, Edvise, or Legacy on POST /institutions
- Remove custom schema resolution and Databricks extension generation for uploads
- Fix PATCH /institutions to persist allowed_schemas to inst.schemas column
- LEGACY_SCHEMA_GROUP stores UNKNOWN only; drop validation_extension module
- Update tests and default fixtures for typeless/custom removal

Made-with: Cursor

* feat(api): harden institutions API after custom-institution removal

- POST/PATCH: require exactly one school type (pdp, edvise, or legacy)
- PATCH: recompute schemas only when the type triple changes; merge optional allowed_schemas on change
- PATCH: honor is_edvise/is_legacy for auto-assigned ids (POST parity)
- Docs/tests: validation namespaces; disambiguate custom naming in code and tests

Made-with: Cursor

* docs(api): revert broad custom wording; keep upload docs accurate

Restore original docstrings and test names where "custom" referred to\nconverters, schema config, or JSON keys—not custom institutions.\n\nKeep gcsutil validate_file institution_id line aligned with pdp/edvise/legacy\nonly (no institution-UUID-for-custom upload path).

Made-with: Cursor

* fix(institutions): reject POST duplicate when existing row lacks school type

When (name, state) matches an existing InstTable row, validate stored\npdp_id/edvise_id/legacy_id the same as new creates: at most one non-null\nand exactly one required. Return 400 with guidance instead of 200 for\ntypeless or invalid rows. Add regression tests.

Made-with: Cursor

* test(institutions): cover duplicate POST, PATCH flags, allowed_schemas-only

- Reject is_pdp without pdp_id on POST\n- Reject duplicate (name, state) when stored row has conflicting ids\n- Reject PATCH is_edvise on PDP row without clearing pdp_id\n- Reject PATCH with both is_edvise and is_legacy\n- allowed_schemas-only PATCH replaces schemas when type unchanged

Made-with: Cursor

* refactor(institutions): extract PATCH helpers and DRY school-type errors

- Add shared mutual-exclusion detail constant for POST/PATCH paths
- Extract duplicate-post row validation and PATCH merge/validate/persist helpers
- Keep update_inst within single-responsibility helpers; reuse row response mapper

Made-with: Cursor

* fix(lint): satisfy ruff and mypy on databricks and institutions

- Remove unused HTTPException import from databricks.py (F401)
- Cast ORM row in _require_single_institution_row_by_uuid for InstTable (no-any-return)

Made-with: Cursor

* style(institutions): apply ruff format to router and tests

Made-with: Cursor

* refactor: simplify local_inst_data

* docs: Update local_inst_data instructions

* chore: remove unused import

* fix: make pdp_id and state optional

* chore: bumping pyproject and uv.lock

---------

Co-authored-by: Mesh <meshach.ogunmodede@datakind.org>
Co-authored-by: Meshach Ogunmodede <142531479+Mesh-ach@users.noreply.github.com>
Co-authored-by: William Carr <bill.carr@datakind.org>
Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com>
Co-authored-by: Hannah Ofstedahl <hannahxchapman@yahoo.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: William Carr <bill@datakind.org>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: kaylawilding <95330483+kaylawilding@users.noreply.github.com>

* Revert "feat: consolidating staging into main and using main going forward as…" (#236)

This reverts commit 9b70f23.

* Merge develop into main (#240)

* docs: inherit org community health files (#237)

* docs: remove local community health files to inherit from org-wide .github repo

* docs: update README to include previous contributing info

* feat(api): simplify create model request to name only (#238)

* chore: bump edvise dependency to 1.0.0 (#241)

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* chore: creating dummy changlog.md file while we create semver / gitflow process

---------

Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com>
Co-authored-by: William Carr <bill@datakind.org>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* chore(release): edvise-api 1.0.0 (#249)

* docs: inherit org community health files (#237)

* docs: remove local community health files to inherit from org-wide .github repo

* docs: update README to include previous contributing info

* feat(api): simplify create model request to name only (#238)

* chore: bump edvise dependency to 1.0.0 (#241)

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* chore: creating dummy changlog.md file while we create semver / gitflow process

* feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy) (#239)

* Trigger GCS→bronze Databricks sync after validation (Edvise/Legacy)

- Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync
  with include_blob_paths_json for validated/{file_name}.
- Call after successful validate-upload / validate-sftp when edvise_id or legacy_id
  is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable.
- Failures to start the job are logged and do not fail validation.
- Extend data tests with DatabricksControl mock and assertions.

Made-with: Cursor

* feat(data): bronze sync after validation with tracing and job id resolution

Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/
writes. Add correlation_id and JSON trace logs (validation_request, background
start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by
name with duplicate detection when unset. Refine skip reasons for PDP vs
Edvise/Legacy.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(data): align bronze sync with universal principles

Extract Databricks helpers and job-parameter constants, use specific
exceptions (ValueError, DatabricksError), and split background logging
into focused functions under 50 lines. Add tests for PDP-only and env
kill-switch skips plus run_now parameter contract coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply ruff format to bronze sync modules

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): resolve prefixed bronze sync Databricks jobs

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): map bronze sync job ids by environment

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): trigger bronze sync during validation request

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(api): validate Edvise uploads with repo schemas (#242)

* feat(api): validate Edvise uploads with repo schemas

Route Edvise student and course uploads through upstream edvise Pandera schemas so upload validation matches the pipeline contract and is not bypassed by registry schema drift.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(api): separate Edvise validation routing

Keep PDP and Edvise repo-schema upload validation paths distinct so each helper has a single responsibility while preserving the same validation behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(api): remove redundant repo validation fallback

Keep JSON validation flow focused now that PDP and Edvise repo-schema uploads are routed before schema merging.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style(api): format validation routing test

Apply Ruff formatting to keep the Edvise validation routing tests passing style checks.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(webapp): establish pyproject.toml as canonical Edvise API version (#243)

* feat(webapp): establish pyproject.toml as canonical Edvise API version, set version and title in OpenAPI

* docs: rename SST -> Edvise

* docs(webapp): update formatter instructions to ruff to align with github workflows and engineering playbook

* test(webapp): assert OpenAPI version matches pyproject.toml

* feat(eda): add clear_cache option to /eda endpoint (#233)

* feat: legacy school inference DB job trigger (#212)

* feat: custom school inference, but need to confirm if custom is the same as legacy

* fix: transitioning from 'custom' to 'legacy'

* fix: remove validation of job parameters, handled already through edvise

* fix: run request still requires str values, defaulting to empty string

* fix: still getting pydantic error

* feat: using substring matching to find legacy job since i have it deployed under my name because of target==dev

* fix: style

* fix: style

* fix: style

* fix: making batch file name more robust so we don't run into decoding issues

* fix: merge conflict

* fix: merge conflict

---------

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* feat: add gen ai as 4th option in addition pdp / edvise / legacy in api and uploads (#244)

* feat: Added "GenAI" as an option for "create institution"

note: for GenAI raw files, we will reuse the same loose rules as Legacy institutions (read CSV, PII check, no strict ES columns).

* fix: style

---------

Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* fix(models): derive PDP batch schema configs from institution schemas (#247)

* fix(models): derive PDP batch schema configs from institution schemas

When model.schema_configs is null, PDP inference now builds a default
required COURSE+STUDENT (etc.) batch rule from inst.schemas instead of
500ing. Explicit model configs still take precedence.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(models): cast jsonpickle decode for mypy no-any-return

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(databricks): prefer Cloud Run job when pipeline name is ambiguous

When multiple dev bundle jobs match a PDP or legacy inference pipeline
substring, resolve to [dev dev_cloudrun_sa] if present; otherwise use
the first sorted match instead of failing the inference request.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* ci: datakind shared workflows (#245)

* ci: datakind shared workflows

* refactor: rename test.yml -> tests.yml

* fix(ci): add workflow_call to style and tests workflows

* refactor: use pre-release workflow from shared workflows

* ci: replace with shared enforce-pr-targets workflow

Aligns checks against the current protected branches, main and develop, rather than staging

* chore: remove unused workflow

* refactor: remove pull_request triggers. These run via ci.yml

* ci: pin tests and type-check to Python 3.13

* chore(ci): remove legacy webapp-and-worker precommit workflow

* ci: standardize on Python 3.12 across workflows and pyproject

* ci: test workflow enforcement

* ci: test workflow enforcement

* ci: add gate job to report required ci status check

* chore: bump python version to 3.10

* chore: standardize Python 3.12 across project and Docker

* chore: updating edvise v1.2.0

* chore: CHANGELOG.md update + type check

* chore(release): bump version

* ci(cloudbuild): parameterize webapp deploy for multi-environment triggers

---------

Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com>
Co-authored-by: Vishakh Pillai <64162993+vishpillai123@users.noreply.github.com>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>

---------

Co-authored-by: Meshach Ogunmodede <142531479+Mesh-ach@users.noreply.github.com>
Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com>
Co-authored-by: Hannah Ofstedahl <hannahxchapman@yahoo.com>
Co-authored-by: Mesh <meshach.ogunmodede@datakind.org>
Co-authored-by: William Carr <bill.carr@datakind.org>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: William Carr <bill@datakind.org>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: kaylawilding <95330483+kaylawilding@users.noreply.github.com>
Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com>
Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>
kaylawilding added a commit that referenced this pull request Jun 29, 2026
* docs: inherit org community health files (#237)

* docs: remove local community health files to inherit from org-wide .github repo

* docs: update README to include previous contributing info

* feat(api): simplify create model request to name only (#238)

* chore: bump edvise dependency to 1.0.0 (#241)

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* chore: creating dummy changlog.md file while we create semver / gitflow process

* feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy) (#239)

* Trigger GCS→bronze Databricks sync after validation (Edvise/Legacy)

- Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync
  with include_blob_paths_json for validated/{file_name}.
- Call after successful validate-upload / validate-sftp when edvise_id or legacy_id
  is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable.
- Failures to start the job are logged and do not fail validation.
- Extend data tests with DatabricksControl mock and assertions.

Made-with: Cursor

* feat(data): bronze sync after validation with tracing and job id resolution

Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/
writes. Add correlation_id and JSON trace logs (validation_request, background
start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by
name with duplicate detection when unset. Refine skip reasons for PDP vs
Edvise/Legacy.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(data): align bronze sync with universal principles

Extract Databricks helpers and job-parameter constants, use specific
exceptions (ValueError, DatabricksError), and split background logging
into focused functions under 50 lines. Add tests for PDP-only and env
kill-switch skips plus run_now parameter contract coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply ruff format to bronze sync modules

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): resolve prefixed bronze sync Databricks jobs

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): map bronze sync job ids by environment

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): trigger bronze sync during validation request

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(api): validate Edvise uploads with repo schemas (#242)

* feat(api): validate Edvise uploads with repo schemas

Route Edvise student and course uploads through upstream edvise Pandera schemas so upload validation matches the pipeline contract and is not bypassed by registry schema drift.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(api): separate Edvise validation routing

Keep PDP and Edvise repo-schema upload validation paths distinct so each helper has a single responsibility while preserving the same validation behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(api): remove redundant repo validation fallback

Keep JSON validation flow focused now that PDP and Edvise repo-schema uploads are routed before schema merging.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style(api): format validation routing test

Apply Ruff formatting to keep the Edvise validation routing tests passing style checks.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(webapp): establish pyproject.toml as canonical Edvise API version (#243)

* feat(webapp): establish pyproject.toml as canonical Edvise API version, set version and title in OpenAPI

* docs: rename SST -> Edvise

* docs(webapp): update formatter instructions to ruff to align with github workflows and engineering playbook

* test(webapp): assert OpenAPI version matches pyproject.toml

* feat(eda): add clear_cache option to /eda endpoint (#233)

* feat: legacy school inference DB job trigger (#212)

* feat: custom school inference, but need to confirm if custom is the same as legacy

* fix: transitioning from 'custom' to 'legacy'

* fix: remove validation of job parameters, handled already through edvise

* fix: run request still requires str values, defaulting to empty string

* fix: still getting pydantic error

* feat: using substring matching to find legacy job since i have it deployed under my name because of target==dev

* fix: style

* fix: style

* fix: style

* fix: making batch file name more robust so we don't run into decoding issues

* fix: merge conflict

* fix: merge conflict

---------

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* feat: add gen ai as 4th option in addition pdp / edvise / legacy in api and uploads (#244)

* feat: Added "GenAI" as an option for "create institution"

note: for GenAI raw files, we will reuse the same loose rules as Legacy institutions (read CSV, PII check, no strict ES columns).

* fix: style

---------

Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* fix(models): derive PDP batch schema configs from institution schemas (#247)

* fix(models): derive PDP batch schema configs from institution schemas

When model.schema_configs is null, PDP inference now builds a default
required COURSE+STUDENT (etc.) batch rule from inst.schemas instead of
500ing. Explicit model configs still take precedence.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(models): cast jsonpickle decode for mypy no-any-return

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(databricks): prefer Cloud Run job when pipeline name is ambiguous

When multiple dev bundle jobs match a PDP or legacy inference pipeline
substring, resolve to [dev dev_cloudrun_sa] if present; otherwise use
the first sorted match instead of failing the inference request.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* ci: datakind shared workflows (#245)

* ci: datakind shared workflows

* refactor: rename test.yml -> tests.yml

* fix(ci): add workflow_call to style and tests workflows

* refactor: use pre-release workflow from shared workflows

* ci: replace with shared enforce-pr-targets workflow

Aligns checks against the current protected branches, main and develop, rather than staging

* chore: remove unused workflow

* refactor: remove pull_request triggers. These run via ci.yml

* ci: pin tests and type-check to Python 3.13

* chore(ci): remove legacy webapp-and-worker precommit workflow

* ci: standardize on Python 3.12 across workflows and pyproject

* ci: test workflow enforcement

* ci: test workflow enforcement

* ci: add gate job to report required ci status check

* chore(release): sync develop with main (#251)

* Merge pull request #217 from datakind/fix/pdp-course-handling-duplicates

fix: fix duplicate-handling step in validation

* fix(storage): reduce peak memory during upload validation

- Download unvalidated blob to a temp file and validate by path instead of
  blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy).
- Write validated CSV to a temp file and upload_from_filename instead of
  building the entire CSV in a StringIO string.

Branched from develop (repo has no dev branch).

Made-with: Cursor

* chore(storage): log errno on temp download/to_csv OSError

Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged.

Made-with: Cursor

* test(storage): cover temp cleanup and OSError logging for validate upload

- Download OSError: unlink temp, skip validate_file_reader, log errno
- to_csv OSError: unlink temp, no upload, log errno
- Upload failure after to_csv: temp still unlinked

Made-with: Cursor

* refactor(storage): extract temp download/unlink helpers for clarity

Aligns with universal-principles: keep _run_validation_and_get_normalized_df
under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming.

Made-with: Cursor

* style: apply black/ruff format to gcsutil_test.py

Made-with: Cursor

* feat: consolidating staging into main and using main going forward as production (#234)

* Feat: Added backfill endpoint

* Fix: linting

* added func description

* added func description

* added func description

* added func description

* added func description

* added func description

* added func description

* feat: adjusted run output endpointto return model_run_id

* Delete .DS_Store

* Delete src/.DS_Store

* Delete terraform/.DS_Store

* feat: added model deletion endpoint

* feat: added model deletion endpoint

* feat: added model deletion endpoint

* fix: linting

* fix: linting

* fix: linting

* fix: linting

* fix: linting

* fix: linting

* fix: linting

* fixed model name malformation

* fix: removed databricks deletion functionality

* fix: removed query results not needed

* fix: removed query results not needed

* fix: added status

* fix: added status

* fix: formatting fix

* fix: added query to retrieve model id

* fix: added passive delete to db cascade so deleting the model ensures job runs are deleted

* fix: removed extra db query for model id, since db now handles passive deletes

* fix: formatting fix

* fix: removed db mapping framework

* fix: removed db mapping framework

* fix: removed db mapping framework

* fix: removed db mapping framework

* feat: changed endpoint parameter name from experiment_run_id -> model_run_id

* fix: type check errors

* test batch and file data

* eda endpoints

* test data

* eda calculations

* eda year and term, course enrollemnts

* eda degree types

* fix: divide data category into a seperate front end table section

* fix: linting

* feat: developed function for adding custom jobs with institution and model validation

* fix: linting errors

* fix: linting errors

* fix: changed route from GET to POST

* fix: added output filename definition

* fix: linting errors

* eda test institution data

* eda test institution

* eda data

* eda test data

* allow missing eda data

* eda enrollment type by intensity

* eda pell recipient by 1st gen

* eda student age by gender

* eda pell status by race

* eda tests

* cache eda

* tidy up

* remove LOCAL test bucket setup

* return List from get_term_counts

* import pandas

* remove unused variable

* tidy up

* eda bucket names

* fix: type check errors

* fix: type check errors

* fix: type check errors

* fix: formatting errors

* fix: type check errors

* fix: type check errors

* fix: batch name renewal

* fix: batch name renewal

* fix: changed output_valid to true

* fix: adjusted model card file path

* fix: ensuring we are grabbing the most recent run for a model id

* remove colors from /eda endpoint

* return count and percentage in /eda degree_types

* tidy up

* fix: fix file format

* fix: retrieve by model_run_id instead

* fix: formatting

* fix: validation error for worwic

* fix: changed model name to model_run_id parameter

* fix: added function to retrieve config.toml from select catalog

* manually initialized course mappings

* feat: added validation mapping

* fix: formatting

* fix: pylint

* Ignore .cursor folder for personal cursor preferences

* feat(schema): add Edvise schema definition

* feat(institutions): add Edvise schema support

Add Edvise schema support to institution management:

- Add edvise_id field to InstTable and SchemaRegistryTable

- Update create/update endpoints with Edvise support and validation

- Add mutual exclusivity check (PDP vs Edvise)

- Implement normalization for empty strings and whitespace

- Remove redundant boolean flags (derive status from ID presence)

- Add comprehensive test coverage (34 new test cases)

All changes are backward compatible.

* fix: resolve CI/CD test failures

- Fix test_create_inst_with_edvise_success: use unique institution name to avoid UNIQUE constraint violation
- Fix test_trigger_inference_run: add pdp_id to InstTable fixture in models_test.py
- Fix code formatting: run ruff format on database.py, institutions.py, and institutions_test.py

These fixes address the three issues that were causing CI/CD test failures:
1. UNIQUE constraint failed: inst.name in test_create_inst_with_edvise_success
2. Assertion error: expected 400 but got 501 in test_trigger_inference_run
3. Ruff format check failures

* fix: resolve unique constraint conflicts in SchemaRegistryTable

- Add doc_type to is_pdp and is_edvise unique constraints to allow base, PDP, and Edvise schemas to coexist with same version
- Add CheckConstraint to enforce mutual exclusivity of is_pdp and is_edvise flags

Fixes Bugbot issue: Unique constraint prevented coexisting schema types for same version. The original constraints (is_pdp, version_label) and (is_edvise, version_label) prevented base schema and PDP/Edvise extensions from sharing the same version label since they all had is_pdp=False and is_edvise=False. Adding doc_type to these constraints allows proper coexistence while maintaining uniqueness guarantees.

Also adds database-level enforcement that is_pdp and is_edvise cannot both be True simultaneously.

* fix: resolve mypy type errors

- Fix type error in institutions.py: change set to list for requested_schemas default value
- Add return type annotations to all test functions in institutions_test.py
- Add return type annotations to fixture functions
- Add typing.Any import for fixture return types

Fixes mypy errors: incompatible types in assignment and missing return type annotations.

* fix: add missing type annotations to test function parameters

- Add TestClient type annotations to test_create_inst_unauth, test_create_inst, test_edit_inst, and test_delete_inst

Fixes mypy errors: Function is missing a type annotation for one or more arguments.

* feat: Implement Phase 3 Edvise schema validation logic

- Add EDVISE_SCHEMA_GROUP constant to utilities.py (mirrors PDP_SCHEMA_GROUP)
- Add _edvise_cache to _ValidationState class for schema caching with TTL
- Update validation_helper() to load Edvise schema when edvise_id is set
- Add defensive check for mutual exclusivity (pdp_id and edvise_id cannot both be set)
- Add error handling for missing Edvise schema with clear error messages
- Update institution creation endpoint to use EDVISE_SCHEMA_GROUP when edvise_id is provided
- Add comprehensive test suite: 15 tests covering happy path, errors, cache, authorization, and edge cases

This implementation enables institutions with edvise_id to use the Edvise schema extension
for file validation, following the same pattern as PDP schema validation. All changes are
backwards compatible and include comprehensive test coverage (~90% of critical paths).

* fix: Resolve Edvise test failures and improve test reliability

- Fix type annotation error in PDP schema branch (mypy no-redef)
- Change test user to DATAKINDER for multi-institution access
- Fix database constraint violation in precedence test (version_label)
- Simplify cache tests to verify behavior instead of implementation
- Remove duplicate assertion in cache expiration test
- Optimize imports in test fixture

* fix: Update Edvise test filenames to include descriptive keywords

- Change generic test filenames (test.csv, test_file.csv, etc.) to include 'student' keyword
- This allows validation_helper to properly infer model types from filenames
- Fixes ValueError: Could not infer model(s) from file name errors
- Formatting will be applied by CI ruff formatter

* style: Format data_test.py with ruff

* fix(validation): return proper HTTP status codes for institution errors

- Change ValueError to HTTPException (404) when institution not found in validation_helper
- Fix test_validate_edvise_unauthorized to test actual unauthorized access instead of non-existent institution
- Ensures proper HTTP status codes are returned to API clients

* fix: handle filename inference errors and extension schema deactivation

- Replace ValueError with HTTPException (422) for filename inference failures
  to return proper user-facing error instead of 500
- Deactivate existing extension schemas before inserting new ones to ensure
  only one active extension per institution and prevent nondeterministic queries
- Add comprehensive validation error formatter with PII masking and user-friendly messages
- Add integration and snapshot tests for error formatter

* fix: remove unused imports from validation_error_formatter_snapshot_test

- Remove unused typing imports (Any, Dict, List)
- Remove unused pandera imports (DataFrameSchema, Column, Check)
- Remove unused MAX_ERROR_EXAMPLES import

Fixes ruff linting errors (F401) reported in CI.

* fix: resolve test failures and configuration issues

- Remove invalid catalog_name parameter from create_custom_schema_extension call
- Restore testpaths configuration to use src directory
- Add Pandera FutureWarning filter to pytest config
- Fix syntax warning in databricks.py docstring
- Format files with Ruff

* fix: resolve Ruff and Mypy linting errors

- Remove unused imports (IO, cast, tomli/tomllib) from databricks.py
- Remove duplicate import re statement
- Add type annotations to test cases in validation_error_formatter_test.py
- Add type: ignore comments for intentional invalid type tests

* fix: align database constraints with production schema and fix Edvise version_label collision

- Fix uq_pdp_version constraint to match production: remove doc_type (matches actual DB schema)
- Remove uq_edvise_version constraint (enforced operationally, not via DB constraint)
- Update CHECK constraint to use MySQL-compatible boolean values (1/0 instead of TRUE/FALSE)
- Fix Edvise test fixture to use version_label='edvise-1.0.0' to avoid uq_pdp_version collision
- Add explanatory comment about version_label choice in test fixture

These changes ensure the ORM matches the actual production database schema and prevent
constraint violations when running tests against MySQL.

* fix: handle parameterized Pandera check types in validation error formatting

Fix bug where parameterized check types (e.g., "isin(['A', 'B', 'C'])",
"str_length(3, None)") were not being matched to their formatters, causing
generic error messages instead of human-readable ones.

Changes:
- Add _extract_base_check_type() to extract base type from parameterized
  check types (e.g., "isin(['A', 'B'])" -> "isin")
- Add _normalize_check_type_alias() to map verbose Pandera names to spec
  keys (e.g., "greater_than" -> "gt", "greater_than_or_equal_to" -> "ge")
- Update _find_check_spec() to use base type extraction and alias normalization
- Update _format_check_error() to only format when matching spec is found
  (prevents semantic errors like formatting "greater_than" as "ge")
- Add _format_gt_error() and _format_lt_error() for strict comparison checks
- Preserve semantic correctness: strict comparisons (> and <) vs non-strict (≥ and ≤)

Edge cases handled:
- Namespaced types: "Check.isin(['A'])" -> "isin"
- Empty/None/non-string inputs: returns safe empty string
- Spaces around parentheses: "isin (['A'])" -> "isin"
- Complex repr: "str_matches(re.compile('...'))" -> "str_matches"

Testing:
- Add comprehensive unit tests for base type extraction and alias handling
- Add tests for parameterized check types (isin, str_length, gt, ge)
- Update integration test assertion to match actual output format
- Update snapshot fixtures to reflect new human-readable messages

Fixes parameterized check type matching while maintaining semantic
correctness for strict vs non-strict comparisons.

* style: format validation_error_formatter files with ruff

Auto-formatted files to comply with project formatting standards.

* feat: add case-insensitive institution name lookup

- Implement case-insensitive matching for GET /institutions/name/{inst_name} endpoint
- Use func.lower() on both database column and input parameter for case-insensitive comparison
- Update docstring to document case-insensitive behavior and error handling
- Add comprehensive test cases for case-insensitive matching:
  - Test multiple case variations (original, title case, uppercase, mixed case)
  - Test lowercase input matching database entries
  - Test uppercase input matching lowercase database entries
- Fix type error: change requested_schemas assignment from set to list for type consistency
- Apply code formatting with ruff

* fix: add missing return type annotations to test functions

- Add Generator import from typing for fixture return types
- Add return type annotations (-> None) to all test functions:
  - test_read_all_inst
  - test_read_all_inst_datakinder
  - test_read_inst_by_name
  - test_read_inst_by_name_case_insensitive
  - test_read_inst_by_name_case_insensitive_lowercase
  - test_read_inst_by_name_case_insensitive_uppercase
  - test_read_inst_by_pdp_id
  - test_read_inst
- Fix fixture return types to use Generator[TestClient, None, None]
  - client_fixture
  - datakinder_client_fixture
- Resolves mypy type checking errors for test file

* style: apply ruff formatting to test file

- Split long function signatures across multiple lines for readability
- Format client_fixture and datakinder_client_fixture function signatures
- Format test_read_inst_by_name_case_insensitive_lowercase and _uppercase function signatures

* fix(test): update institutions test for edvise_id API changes

- Remove unused typing.Any import
- Update test_read_all_inst_datakinder to include edvise_id in expected response
- Add edvise_test_school institution to expected response (4 institutions total)
- Fix line length for pylint compliance

This fixes test failures caused by API changes from develop branch that now
return edvise_id and pdp_id fields for all institutions.

* fix(validation): pass institution_id so Edvise/PDP/custom use correct extension block

- Thread schema_namespace (edvise | pdp | inst UUID) from data router through
  validate_file and validate_file_reader into validate_dataset
- merge_model_columns now receives correct key for extension_schema['institutions']
- Add institution_id param with default 'pdp' for backward compatibility
- Add tests: assert Edvise validation passes institution_id='edvise'; add unit test
  that institution_id selects the right extension block (edvise vs pdp)
- Expand docstrings (Args/Returns) and add comment explaining schema_namespace
- Addresses reviewer Q1: schema extension logic now works for Edvise and custom
  institutions, not only PDP

* Apply Black formatting to institutions_test.py

* Apply ruff format to institutions_test.py

* Fix institutions_test assert for Black and Ruff format compatibility

* Fix pylint E1135 in data_test: use .get() instead of membership test on captured_schema

* Apply ruff format to data_test.py

* feat(validation): schema validation during upload with PDP/edvise repo alignment

- Add PDP edvise schema validation path (validation_pdp_edvise)
- Add Edvise-to-PDP normalization (validation_edvise_normalize)
- Integrate repo schemas into validation pipeline and error formatter
- Update pdp_schema_extension and lockfile; add tests

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(validation): write normalized data to validated/, archive raw to raw/

- On validation success: archive original to raw/{filename}, write
  normalized (canonical columns, coerced dtypes) DataFrame to
  validated/{filename}, delete from unvalidated/
- Validation layer always returns normalized_df on success; storage
  serializes to UTF-8 CSV and uploads to validated/
- Add input validation and helpers in gcsutil (under 50 lines); catch
  specific exceptions; TYPE_CHECKING for HardValidationError in
  validation_pdp_edvise
- Add gcsutil_test.py: validate_file input/error/success paths,
  _run_validation_and_get_normalized_df, _write_dataframe_to_gcs_as_csv
- Add validation_test: empty-schema short-circuit returns normalized_df None
- Ruff/black formatting and lint fixes; mypy-clean for touched files

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(validation): align with universal principles, add tests, fix types and format

- Extract validation helpers to meet 50-line rule (_header_missing_and_extra,
  _get_csv_read_kwargs, _validate_optional_columns_json)
- Extract gcsutil._archive_raw_and_write_validated; add type hints to rename_file
- Add tests: PDP rename/validate_dataframe, CSV read failure, gcsutil error
  propagation, edvise institution_identifier in validate_file call
- Remove unused validation_edvise_normalize and its tests
- Fix mypy in validation_pdp_edvise and tests (Optional[List], cast, annotations)
- Apply ruff format

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(validation): use edvise read for PDP uploads and add PDP path tests

- Route PDP cohort/course through edvise read (read_raw_pdp_*); remove
  API-side normalizers for PDP so pipeline and API share one source of truth
- Add _path_for_edvise_read, _read_pdp_course_edvise, _validate_pdp_with_edvise_read
- Convert Pandera SchemaErrors to HardValidationError in PDP path
- Add validation_pdp_read_path_test.py (routing, path cleanup, SchemaErrors,
  course converter fallback); extend Src type with io.StringIO for file-like

Co-authored-by: Cursor <cursoragent@cursor.com>

* move cloud build config to repo

* sst-app-api -> edvise-api

* quiet down sqlalchemy

* use EdaSummary from edvise

* use ruff formatter

* test a file

* tidy up

* Add return type annotations for mypy in main_test and users_test

* tidy up

* move cache check after batch result check

* fix test_execute_pdp_pull

* install git

* install git in correct Dockerfile

* install git in worker

* update edvise branch

* use develop branch for edvise

* install edvise in build

* cloudbuild with edvise

* fix(validation): resolve pylint used-before-assignment error

Initialize schema_err_to_raise before try block to satisfy pylint's
static analysis, which doesn't recognize that pytest.skip() always raises.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(api): add legacy school type with any-format uploads

- Add legacy_id to InstTable and institution API (create, update, read)
- Enforce mutual exclusivity of pdp_id, edvise_id, legacy_id via has_at_most_one_school_type
- Legacy validation: encoding + CSV read only, no schema checks
- Add LEGACY_SCHEMA_GROUP and tests for legacy path and mutual exclusivity

Made-with: Cursor

* feat(api): legacy PII check, principles compliance, and test coverage

- Add PII column check for legacy uploads; reject before raw/validated
- Treat student_id as non-PII (false positive) for all institution types
- Comply with universal principles: docstrings, extract create_institution
  helpers (<50 lines), comment lazy import in validation
- Add tests: has_at_most_one_school_type, legacy header-only CSV,
  legacy PII rejection returns 400, explicit legacy_id create,
  update add legacy_id, storage/Databricks failure paths
- Fix mypy in create_institution (row variable)

Made-with: Cursor

* docs(api): use Edvise Schema (ES) naming to reduce confusion

Replace 'Edvise schema' with 'Edvise Schema (ES)' in docstrings,
comments, and user-facing error messages so the schema type is
distinguished from the Edvise product (ES convention).

Made-with: Cursor

* feat(data): allow legacy institutions to upload files with any filename

- Fetch institution before filename inference; set allowed_schemas to UNKNOWN when
  inference fails for legacy (non-legacy still get 422 for non-descriptive names)
- Refactor validation_helper into helpers under 50 lines; add full docstrings,
  early empty-filename and invalid inst_id validation, log before 404
- Add unit tests for _infer_allowed_schemas_from_filename and _ext_models_set
- Add integration tests: empty filename 422, invalid inst_id 404, edvise
  non-descriptive filename 422, duplicate validate idempotent
- Fix mypy and ruff/black in data.py and data_test.py
- Add PR_DESCRIPTION.md for feature branch

Made-with: Cursor

* chore: remove PR_DESCRIPTION.md

Made-with: Cursor

* fix(validation): run PII check for header-only legacy CSVs

* fix(test): align validation error snapshot with non-PII student_id display

Made-with: Cursor

* feat(validation): use PDP cohort converter and support custom converters

- Use converter_func_cohort by default for PDP cohort validation (filters DE/DS/SE)
- Add optional pdp_cohort_converter_func and pdp_course_converter_func to
  validate_file_reader and validate_dataset for school-specific overrides
- Course validation tries custom converter first, then default handling_duplicates
- Validate converter args are callable; convert converter/read failures to
  HardValidationError so API returns 400 with context
- Add PDPConverterFunc type; extract helpers to meet 50-line and error-handling rules

Made-with: Cursor

* fix(validation): satisfy mypy for PDP validation and tests

- Add unreachable return after with block in _validate_pdp_with_edvise_read
- Use cast(Any, ...) in tests that pass non-callables to converter params

Made-with: Cursor

* chore: remove real institution names

* chore: ruff format

* fix: use latest edvise EdaSummary

* fix: use edvise develop branch

* chore(deps): pin edvise to develop

* feat(ci): notify slack channel on deployment

* fix: lock file was out of sync

* chore: bump edvise version to 0.1.12

* Revert "feat: legacy school type with any-format uploads, PII check, and Edvise Schema (ES) naming"

* Revert "Revert "feat: legacy school type with any-format uploads, PII check, and Edvise Schema (ES) naming""

* feat(config): add optional local inst/batch/file seed from config for LOCAL

* style: ruff format

* fix(validation): pass schema_type to handling_duplicates for PDP course CSV

read_raw_pdp_course_data calls converter_func(df) with one argument; bare
handling_duplicates is invalid on current edvise. Use a wrapper that calls
handling_duplicates(df, "pdp") positionally for edvise compatibility.

Remove the broken second default converter. Update PDP read path test.

Made-with: Cursor

* style: ruff format PDP course read path test

Made-with: Cursor

* fix(deps): upgrade databricks-sql-connector for pyarrow>=17 (edvise)

databricks-sql-connector 3.5 pins pyarrow<17; edvise requires pyarrow>=17.
Use databricks-sql-connector[pyarrow]~=4.2.x and refresh uv.lock (pyarrow 19).

Aligns lock with Cloud Build 'uv lock --upgrade-package edvise'.

Made-with: Cursor

* Merge pull request #217 from datakind/fix/pdp-course-handling-duplicates

fix: fix duplicate-handling step in validation

* fix(storage): reduce peak memory during upload validation

- Download unvalidated blob to a temp file and validate by path instead of
  blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy).
- Write validated CSV to a temp file and upload_from_filename instead of
  building the entire CSV in a StringIO string.

Branched from develop (repo has no dev branch).

Made-with: Cursor

* chore(storage): log errno on temp download/to_csv OSError

Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged.

Made-with: Cursor

* test(storage): cover temp cleanup and OSError logging for validate upload

- Download OSError: unlink temp, skip validate_file_reader, log errno
- to_csv OSError: unlink temp, no upload, log errno
- Upload failure after to_csv: temp still unlinked

Made-with: Cursor

* refactor(storage): extract temp download/unlink helpers for clarity

Aligns with universal-principles: keep _run_validation_and_get_normalized_df
under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming.

Made-with: Cursor

* style: apply black/ruff format to gcsutil_test.py

Made-with: Cursor

* fix(storage): reduce peak memory during upload validation

- Download unvalidated blob to a temp file and validate by path instead of
  blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy).
- Write validated CSV to a temp file and upload_from_filename instead of
  building the entire CSV in a StringIO string.

Branched from develop (repo has no dev branch).

Made-with: Cursor

* chore(storage): log errno on temp download/to_csv OSError

Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged.

Made-with: Cursor

* test(storage): cover temp cleanup and OSError logging for validate upload

- Download OSError: unlink temp, skip validate_file_reader, log errno
- to_csv OSError: unlink temp, no upload, log errno
- Upload failure after to_csv: temp still unlinked

Made-with: Cursor

* refactor(storage): extract temp download/unlink helpers for clarity

Aligns with universal-principles: keep _run_validation_and_get_normalized_df
under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming.

Made-with: Cursor

* style: apply black/ruff format to gcsutil_test.py

Made-with: Cursor

* chore: bump edvise v0.2.0

* fix(pdp-validation): default cohort converter to none

Stop passing edvise converter_func_cohort when pdp_cohort_converter_func is omitted so PDP cohort rows are validated as read.

- Callers may still pass an explicit cohort converter.

- Update PDP read-path test to expect converter_func=None.

- Refresh docstrings (pipeline vs API, Args/Returns/Raises) in validation and validation_pdp_edvise.

Made-with: Cursor

* feat(api): remove custom institution path; require school type; legacy schemas UNKNOWN

- Require exactly one of PDP, Edvise, or Legacy on POST /institutions
- Remove custom schema resolution and Databricks extension generation for uploads
- Fix PATCH /institutions to persist allowed_schemas to inst.schemas column
- LEGACY_SCHEMA_GROUP stores UNKNOWN only; drop validation_extension module
- Update tests and default fixtures for typeless/custom removal

Made-with: Cursor

* feat(api): harden institutions API after custom-institution removal

- POST/PATCH: require exactly one school type (pdp, edvise, or legacy)
- PATCH: recompute schemas only when the type triple changes; merge optional allowed_schemas on change
- PATCH: honor is_edvise/is_legacy for auto-assigned ids (POST parity)
- Docs/tests: validation namespaces; disambiguate custom naming in code and tests

Made-with: Cursor

* docs(api): revert broad custom wording; keep upload docs accurate

Restore original docstrings and test names where "custom" referred to\nconverters, schema config, or JSON keys—not custom institutions.\n\nKeep gcsutil validate_file institution_id line aligned with pdp/edvise/legacy\nonly (no institution-UUID-for-custom upload path).

Made-with: Cursor

* fix(institutions): reject POST duplicate when existing row lacks school type

When (name, state) matches an existing InstTable row, validate stored\npdp_id/edvise_id/legacy_id the same as new creates: at most one non-null\nand exactly one required. Return 400 with guidance instead of 200 for\ntypeless or invalid rows. Add regression tests.

Made-with: Cursor

* test(institutions): cover duplicate POST, PATCH flags, allowed_schemas-only

- Reject is_pdp without pdp_id on POST\n- Reject duplicate (name, state) when stored row has conflicting ids\n- Reject PATCH is_edvise on PDP row without clearing pdp_id\n- Reject PATCH with both is_edvise and is_legacy\n- allowed_schemas-only PATCH replaces schemas when type unchanged

Made-with: Cursor

* refactor(institutions): extract PATCH helpers and DRY school-type errors

- Add shared mutual-exclusion detail constant for POST/PATCH paths
- Extract duplicate-post row validation and PATCH merge/validate/persist helpers
- Keep update_inst within single-responsibility helpers; reuse row response mapper

Made-with: Cursor

* fix(lint): satisfy ruff and mypy on databricks and institutions

- Remove unused HTTPException import from databricks.py (F401)
- Cast ORM row in _require_single_institution_row_by_uuid for InstTable (no-any-return)

Made-with: Cursor

* style(institutions): apply ruff format to router and tests

Made-with: Cursor

* refactor: simplify local_inst_data

* docs: Update local_inst_data instructions

* chore: remove unused import

* fix: make pdp_id and state optional

* chore: bumping pyproject and uv.lock

---------

Co-authored-by: Mesh <meshach.ogunmodede@datakind.org>
Co-authored-by: Meshach Ogunmodede <142531479+Mesh-ach@users.noreply.github.com>
Co-authored-by: William Carr <bill.carr@datakind.org>
Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com>
Co-authored-by: Hannah Ofstedahl <hannahxchapman@yahoo.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: William Carr <bill@datakind.org>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: kaylawilding <95330483+kaylawilding@users.noreply.github.com>

* Revert "feat: consolidating staging into main and using main going forward as…" (#236)

This reverts commit 9b70f238333796f7d9835d5ac5e1c81ee66d11c6.

* Merge develop into main (#240)

* docs: inherit org community health files (#237)

* docs: remove local community health files to inherit from org-wide .github repo

* docs: update README to include previous contributing info

* feat(api): simplify create model request to name only (#238)

* chore: bump edvise dependency to 1.0.0 (#241)

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* chore: creating dummy changlog.md file while we create semver / gitflow process

---------

Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com>
Co-authored-by: William Carr <bill@datakind.org>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* chore(release): edvise-api 1.0.0 (#249)

* docs: inherit org community health files (#237)

* docs: remove local community health files to inherit from org-wide .github repo

* docs: update README to include previous contributing info

* feat(api): simplify create model request to name only (#238)

* chore: bump edvise dependency to 1.0.0 (#241)

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* chore: creating dummy changlog.md file while we create semver / gitflow process

* feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy) (#239)

* Trigger GCS→bronze Databricks sync after validation (Edvise/Legacy)

- Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync
  with include_blob_paths_json for validated/{file_name}.
- Call after successful validate-upload / validate-sftp when edvise_id or legacy_id
  is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable.
- Failures to start the job are logged and do not fail validation.
- Extend data tests with DatabricksControl mock and assertions.

Made-with: Cursor

* feat(data): bronze sync after validation with tracing and job id resolution

Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/
writes. Add correlation_id and JSON trace logs (validation_request, background
start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by
name with duplicate detection when unset. Refine skip reasons for PDP vs
Edvise/Legacy.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(data): align bronze sync with universal principles

Extract Databricks helpers and job-parameter constants, use specific
exceptions (ValueError, DatabricksError), and split background logging
into focused functions under 50 lines. Add tests for PDP-only and env
kill-switch skips plus run_now parameter contract coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply ruff format to bronze sync modules

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): resolve prefixed bronze sync Databricks jobs

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): map bronze sync job ids by environment

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): trigger bronze sync during validation request

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(api): validate Edvise uploads with repo schemas (#242)

* feat(api): validate Edvise uploads with repo schemas

Route Edvise student and course uploads through upstream edvise Pandera schemas so upload validation matches the pipeline contract and is not bypassed by registry schema drift.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(api): separate Edvise validation routing

Keep PDP and Edvise repo-schema upload validation paths distinct so each helper has a single responsibility while preserving the same validation behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(api): remove redundant repo validation fallback

Keep JSON validation flow focused now that PDP and Edvise repo-schema uploads are routed before schema merging.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style(api): format validation routing test

Apply Ruff formatting to keep the Edvise validation routing tests passing style checks.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(webapp): establish pyproject.toml as canonical Edvise API version (#243)

* feat(webapp): establish pyproject.toml as canonical Edvise API version, set version and title in OpenAPI

* docs: rename SST -> Edvise

* docs(webapp): update formatter instructions to ruff to align with github workflows and engineering playbook

* test(webapp): assert OpenAPI version matches pyproject.toml

* feat(eda): add clear_cache option to /eda endpoint (#233)

* feat: legacy school inference DB job trigger (#212)

* feat: custom school inference, but need to confirm if custom is the same as legacy

* fix: transitioning from 'custom' to 'legacy'

* fix: remove validation of job parameters, handled already through edvise

* fix: run request still requires str values, defaulting to empty string

* fix: still getting pydantic error

* feat: using substring matching to find legacy job since i have it deployed under my name because of target==dev

* fix: style

* fix: style

* fix: style

* fix: making batch file name more robust so we don't run into decoding issues

* fix: merge conflict

* fix: merge conflict

---------

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* feat: add gen ai as 4th option in addition pdp / edvise / legacy in api and uploads (#244)

* feat: Added "GenAI" as an option for "create institution"

note: for GenAI raw files, we will reuse the same loose rules as Legacy institutions (read CSV, PII check, no strict ES columns).

* fix: style

---------

Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* fix(models): derive PDP batch schema configs from institution schemas (#247)

* fix(models): derive PDP batch schema configs from institution schemas

When model.schema_configs is null, PDP inference now builds a default
required COURSE+STUDENT (etc.) batch rule from inst.schemas instead of
500ing. Explicit model configs still take precedence.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(models): cast jsonpickle decode for mypy no-any-return

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(databricks): prefer Cloud Run job when pipeline name is ambiguous

When multiple dev bundle jobs match a PDP or legacy inference pipeline
substring, resolve to [dev dev_cloudrun_sa] if present; otherwise use
the first sorted match instead of failing the inference request.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* ci: datakind shared workflows (#245)

* ci: datakind shared workflows

* refactor: rename test.yml -> tests.yml

* fix(ci): add workflow_call to style and tests workflows

* refactor: use pre-release workflow from shared workflows

* ci: replace with shared enforce-pr-targets workflow

Aligns checks against the current protected branches, main and develop, rather than staging

* chore: remove unused workflow

* refactor: remove pull_request triggers. These run via ci.yml

* ci: pin tests and type-check to Python 3.13

* chore(ci): remove legacy webapp-and-worker precommit workflow

* ci: standardize on Python 3.12 across workflows and pyproject

* ci: test workflow enforcement

* ci: test workflow enforcement

* ci: add gate job to report required ci status check

* chore: bump python version to 3.10

* chore: standardize Python 3.12 across project and Docker

* chore: updating edvise v1.2.0

* chore: CHANGELOG.md update + type check

* chore(release): bump version

* ci(cloudbuild): parameterize webapp deploy for multi-environment triggers

---------

Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com>
Co-authored-by: Vishakh Pillai <64162993+vishpillai123@users.noreply.github.com>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>

---------

Co-authored-by: Meshach Ogunmodede <142531479+Mesh-ach@users.noreply.github.com>
Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com>
Co-authored-by: Hannah Ofstedahl <hannahxchapman@yahoo.com>
Co-authored-by: Mesh <meshach.ogunmodede@datakind.org>
Co-authored-by: William Carr <bill.carr@datakind.org>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: William Carr <bill@datakind.org>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: kaylawilding <95330483+kaylawilding@users.noreply.github.com>
Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com>
Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>

* refactor(validation): remove API JSON schema validation from upload pipeline (#246)

* refactor(validation): remove API JSON schema validation

Route upload validation through institution namespaces and the edvise repo Pandera schemas instead of API-local JSON schema documents.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(validation): update upload validation coverage

Cover repo-backed PDP and Edvise upload validation, legacy handling, and unsupported model errors after removing the JSON fallback.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(data): repair GenAI upload validation and enable bronze sync

Correct a merge regression where legacy/GenAI institutions returned a tuple
from _resolve_schema_namespace, include GenAI in GCS→bronze sync, and add
upload validation test coverage for GenAI schools.

Co-authored-by: Cursor <cursoragent@cursor.com>

* ci: temporarily use new asana shared workflow

* ci: use shared asana task link from @main branch

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Vishakh Pillai <64162993+vishpillai123@users.noreply.github.com>
Co-authored-by: William Carr <bill@datakind.org>

* feat: hook up Edvise Schema (ES) inference to api (#253)

* feat: hook up Edvise Schema (ES) inference to api, so we can run it from webapp like PDP; and enhance institution type handling

- Updated `trigger_inference_run` to include support for Edvise Schema (ES) and GenAI alongside existing PDP and Legacy types.
- Enhanced mutual exclusivity check to include `genai_id`.
- Introduced `run_es_inference` method in `DatabricksControl` for triggering ES inference jobs.
- Updated error messages and validation checks to reflect the new schema options.
- Added tests to ensure proper handling of Edvise institutions and inference logic.

* fix: style

* feat: renaming "legacy_model_result" to just "model_result" to encompass both legacy and edvise school results

* feat: renaming from "DatabricksLegacyInferenceRunRequest" to "DatabricksSharedInferenceRunRequest"

also renamed model_result to shared_model_result for consistency

* fix: removing part of comment that's irrelevant

* fix: lint

* fix: removing genai from error message

* feat: creating `is_genai_institution` parameter to feed into genai/edvise inference job for SSoT (#254)

* feat: using `batch_id` parameter for subfolder naming convention during GCS to DB bronze async job (#256)

* feat: use batch parameters for run-inference endpoint for genAI/Edvise/Legacy schools (#257)

* docs(db): add staging-verified shared schema contract for UI and API (#258)

* docs: add shared database schema contract for UI and API tables

Publish canonical ownership and column definitions for users and job
plus inventory of UI-only and API-only tables to support Phase 0
migration split work.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs: verify schema contract against staging all_tables DDL

Align users and job canonical columns with staging SHOW CREATE TABLE
exports (2026-06-24). Document FK on users.inst_id, skip job ALTER on
staging, and exclude backup tables from Alembic scope.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs: document shared users and job tables in API README (#259)

Link DB_SCHEMA_CONTRACT.md and document migration ownership plus
greenfield bootstrap order for the shared Cloud SQL database.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(api): expose model_run_id and model_version on RunInfo endpoints (#260)

* docs: add shared database schema contract for UI and API tables

Publish canonical ownership and column definitions for users and job
plus inventory of UI-only and API-only tables to support Phase 0
migration split work.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs: verify schema contract against staging all_tables DDL

Align users and job canonical columns with staging SHOW CREATE TABLE
exports (2026-06-24). Document FK on users.inst_id, skip job ALTER on
staging, and exclude backup tables from Alembic scope.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs: document shared users and job tables in API README

Link DB_SCHEMA_CONTRACT.md and document migration ownership plus
greenfield bootstrap order for the shared Cloud SQL database.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(api): expose model_run_id and model_version on RunInfo endpoints

Include training run identifiers on list-runs and single-run responses so
the UI can stop falling back to direct job table reads.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(api): mirror accepted_terms and invite_validated on AccountTable (#263)

Keep users ORM aligned with Laravel migrations for shared-table contract
compliance; no DDL change (columns owned by edvise-ui).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: sync ES pipeline rename & allow greater flexibility with legacy/GenAI/ES uploads  (#262)

* fix: rename of es pipeline

* fix: schema fallback for genai & legacy institutions

* fix: allow for any non-empty batch regardless of per-file schema tags

* fix: adding a few PII false positive patterns

* fix: make PII more flexible and stop with the false positives

---------

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>

* fix(api): coerce Databricks model version to str for RunInfo responses (#264)

PR #260 added model_version as a string on RunInfo, but Databricks returns
version as int, causing ResponseValidationError on run-inference for all school types.

Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(release): bump version

* chore(release): update CHANGELOG

* chore: bump edvise from 1.2.0 to 1.4.0

---------

Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com>
Co-authored-by: Vishakh Pillai <64162993+vishpillai123@users.noreply.github.com>
Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com>
Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>
Co-authored-by: Meshach Ogunmodede <142531479+Mesh-ach@users.noreply.github.com>
Co-authored-by: Hannah Ofstedahl <hannahxchapman@yahoo.com>
Co-authored-by: Mesh <meshach.ogunmodede@datakind.org>
Co-authored-by: William Carr <bill.carr@datakind.org>
Co-authored-by: kaylawilding <95330483+kaylawilding@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants